Graph Analytics

There are some problems which don't neatly fall into the category of traditional "machine learning" but can still be quite powerful. The problem we will look at today involves relationships between things which can be organized as graphs, and the tool we will use to investigate it is called graph analytics or network analysis.

The "Karate Club" Network Graph.

What is a "Network Graph"

A Network graph is a mathematical (and usually visual) structure designed to show the relations (called edges) between points (called nodes) in an aesthetically-pleasing way. The graph visualizes how subjects are interconnected with each other. Entities are displayed as nodes and the edges connecting them are displayed with lines. Python has a number of modules designed to work with network graphs, and one of the best is called networkx.

There are a variety of interesting analyses that can be performed on a network, but one of the more interesting ones involves looking for "clustering" of subsets of nodes. This clustering is called "community detection".

Zachary's Karate Club is a small example network included with networkx which can be used to test a number of features of graph analytics, including community detection. See this (https://en.wikipedia.org/wiki/Zachary%27s_karate_club) for more details.

From Wikipedia:

A social network of a karate club was studied by Wayne W. Zachary for a period of three years from 1970 to 1972.[2] The network captures 34 members of a karate club, documenting links between pairs of members who interacted outside the club. During the study a conflict arose between the administrator "Officer" and instructor "Mr. Hi" (pseudonyms), which led to the split of the club into two. Half of the members formed a new club around Mr. Hi; members from the other part found a new instructor or gave up karate. Based on collected data Zachary correctly assigned all but one member of the club to the groups they actually joined after the split.

Let's get the Karate Club network, and print out information about the nodes and edges. We will see that the nodes have only one attribute (the club they ended up with) and the edges have no attribute, other than connecting two nodes.

Nodes could have many attributes, such as a label (in this case possibly the name of the student).

Edges could also have attributes, such as the strength of the connection between the two nodes (also called the weight).

Some helper functions

These will be useful for plotting and layout of our graphs.

Community Detection

Once a network is made, a simple way to look for structure is to see if subsets of the nodes cluster together.

One of the best tools for this is "louvain community detection" (https://perso.uclouvain.be/vincent.blondel/research/louvain.html).

There is a resollution parameter which controls how many clusters are typically found. The default is 1.0. Smaller numbers yield more clusters, while larger numbers yield fewer clusters.

A more fun network: Marvel Comic Book Characters!

marvel

This network is not included with networkx - we have to build it ourselves.

To do this, we will use a file containing a list of comic books and the "heroes" that appear in them. This dataset is from https://www.kaggle.com/csanhueza/the-marvel-universe-social-network.

The data is lines containing "hero" and "comic book".

Forming Links

To link heroes, we will loop over the above dataframe, and do the following:

Once we have the data for each comic book, we can loop over the comic books and link heroes by the common comic books they appear in.

Some printout

Now lets print out the most common hero by count, and then for each of these, print out who they appear together with the most often.

Another way to view the data

The tables above are interesting, but it is a little difficult to tell if all of the heroes are just randomnly connected, or if there is some structure to these connections.

As with the Karate network above, we can learn some interesting thing about the relationships among our heroes if we form them into a network. Our network will have (at least) two primary features:

Form Lookup Tables

The following shows how to connect the found communities to the original list of heroes.

The resulting tables of the largest hero communities and their members are printed, and the results look really sensible!!

Drawing the Marvel Network

Now lets use the same tool to draw our network.

Application of graph analytics to real-world numerical datasets

The above analysis is interesting, but there are not many problems which we will encounter in a typical analysis environment which are like the above Marvel Universe. It is reasonable to ask if these tools might be applicable in other circumstances.

The idea of community detection seems like one that might have such broader application. Community detction is just a form of clustering, which itself is an example of "unsupervised" learning. Can we use community detection in a real work example? The answer is yes.

To do this we will use "fake" datasets (generated using sklearn tools), as well as some helper functions in the code block below. The helper fucntions will allow us to visualize the generated datasets.

Generating Fake Data

sklearn comes with a powerful tool for generating datasets call "make_classification". The resulting generated datasets can then be used to test other algorithms such as classification and regression. We will use this to test the clustering ability of graph analytics algorithms.

The important parameters for "make_classification" are the following:

Number of data points per class

Visualization of High Dimensional Data Using t-sne

Our data has 10 dimensions (from the 10 features we generated it with). How do we visualize it? We could make a network of this data (and we will do that below) and then do community detection and plot the resulting network as we did above. Instead, we will use a methoid called t-sne, to project our 10 dimensions down to 2 or 3.

From Wikipedia:

t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally developed by Sam Roweis and Geoffrey Hinton,[1] where Laurens van der Maaten proposed the t-distributed variant.[2] It is a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.

Our "visualize_3d" and "visualize_2d" functions already incorporate t-sne, if the dimension given to them are aove 3 or 2 respectively. We will see that our 4 generated samples appear as 4-separate clusters aftyer the application of t-sne.

Edges

We need some way to define edges, in oprder to be able to connect our datapoints. To do this, we will use a concept we introduced previously: cosine similarity. Our points each lie in an n-dimensional space (defined by the features). By calculating the cosine between each point, we can define a strength of the connection:

We are assuming that since the data is generated to come from different classes, that the cosine of points ffrom the same classes should be closer than points from different classes.

Compare cosine similarity for points

Look at points of the same class, versus points of different classes.

Define the connections

From the curve above, it looks like we can call connected points, those that have cosine>0.5.

With this definition, we can form a graph.

Run Community Detection

Connect Communities to Classes

The found communities are the same as the number of classes. Remember: this was totally unsupervised! Now we need to see if the communities actually correspond to the true classes.

This is a little more complicated than the marvel universe. In our case, each point already belongs to a class, and we want to know how often our classes are connected to the same community.

Assignment

Now let's apply this to a data sample we used previously: pulsars. What I want you to do is to

  1. Calculate the cosine similarity among all points in the pulsar dataset
  2. Plot the cosine similary for true pulsars (one class) vs non-pulsars (the other class).
  3. Plot the 3D visualization using TSNE.
  4. Run community detection and determine how many communities are found.

An extra credit portion will deal with mathing found communities to true classes.

The code below sets things up by reading the data in!

Assignment Task 1

Calculate the cosine similarity between all of the data points in the pulsar dataset. This is done almost exactly like we did above for our fake dataset.

Assignment Task 2

Plot the cosine similarity of the signal vs background datasets.

Assignment Task 3

Plot the 3D visualization using TSNE (using visualize_3d). This might take awhile since there are alot of points to plot.

Assignment Task 4: Run Community Detection

Run community detection on the pulsar dataset. Use a cut of 0.9 on cosine similarity, and see how many communities you get. Remember that this is an unsupervised task - the community detection algorithm does not know how many classes are in the same.

Make sure you print out how many classes are found.

Extra Credit 1: 4 points total

For the above pulsar dataset, after community detection, determine separately

Extra Credit 2: 5 points total

Use the Karate Network, and figure out how to provide a size to the nodes using a new version of the visualize_2d method.

You will have to modify the 2nd and 3rd blocks to accomplish this task.